1 Abstract

In this document we discuss mass spectrometry (MS) data handling and exploration using the MsExperiment and Spectra Bioconductor packages and perform the preprocessing of a small liquid chromatography (LC)-MS data set xcms package. In addition we use functionality from the MetaboCoreUtils and MsCoreUtils package for general tasks frequently performed in metabolomics data analysis. The preprocessing of the LC-MS data comprises chromatographic peak detection, sample alignment and peak correspondence. Particular emphasis is given on deriving and defining data-set dependent values for the most crucial parameters of popular preprocessing methods.

2 Introduction

Preprocessing is the first step in the analysis of untargeted LC-MS or gas chromatography (GC)-MS data. The aim of the preprocessing is the quantification of signals from ions measured in a sample, adjusting for any potential LC drifts between samples and the matching of the quantified signal across samples within an experiment. The resulting two-dimensional matrix with abundances of the so called LC-MS features in all samples can then be further processed, e.g. by normalizing the data to remove differences due to sample processing, batch effects or injection order-dependent signal drifts. After preprocessing, the LC-MS features usually only characterized by their mass-to-charge ration (m/z) and retention time, have to be annotated to the actual ions and metabolites they represent. Data normalization and annotation are not covered in this document.

2.1 Prerequisites

The analysis in this document requires an R version >= 4.3.0 and recent versions of the MsExperiment and Spectra Bioconductor packages as well as the developmental version of the xcms package. These can be installed using the code below.

#' Install the Bioconductor package manager
install.packages("BiocManager")

#' Install the required packages
BiocManager::install(c("msdata",
                       "Spectra",
                       "MsExperiment",
                       "MetaboCoreUtils",
                       "MsCoreUtils",
                       "png"))
BiocManager::install("sneumann/xcms")

The xcms-preprocessing.Rmd file with all the code for the analysis can be downloaded from [github] (https://github.com/jorainer/metabolomics2018) ideally with git clone https://github.com/jorainer/metabolomics2018 from the command line.

2.2 Mass spectrometry

Mass spectrometry allows to measure abundances of charged molecules (ions) in a sample. Abundances are determined as ion counts for a specific mass-to-charge ratio m/z. The measured signal is represented as a spectrum: intensities along m/z.

Many ions will result, when measured with MS alone, in a very similar m/z making it difficult or impossible to discriminate them. MS is thus frequently coupled with a second technology to separate them prior quantification based on properties other than their mass (e.g. based on their polarity). Common choices are gas chromatography (GC) or liquid chromatography (LC). In a typical LC-MS setup a samples gets injected into the system, molecules are separated in the LC column while the MS instruments continuously (at discrete time points) continues to measure all ions that get generated from the molecules eluting at different time points from the LC-column. Molecules get thus separated on two different dimensions, the retention time dimension (from the LC) and the mass-to-charge dimension (from the MS) making it easier to measure and identify molecules in more complex samples.

In such GC/LC-MS based untargeted metabolomics experiments the data is analyzed along the retention time dimension and chromatographic peaks (which are supposed to represent the signal from ions of a certain type of molecule) are quantified.

2.3 Definitions and common naming convention

Naming conventions and terms used in this document are:

  • chromatographic peak: peak containing the signal from an ion in retention time dimension (different from a mass peak that represents the signal along the m/z dimension within a spectrum).
  • chromatographic peak detection: process in which chromatographic peaks are identified within each file.
  • alignment: process that adjusts for retention time differences (i.e. possible signal drifts from the LC) between measurements/files.
  • correspondence: grouping of chromatographic peaks (presumably from the same ion) across files.
  • feature (or LC-MS features): entity representing signal from the same type of ion/molecule, characterized by its specific retention time and m/z. In xcms, features represent identified chromatographic peaks grouped across files/samples.

3 Workflow: exploring and analyzing LC-MS data with Spectra and xcms

This workflow describes the basic data handling (I/O) of mass spectrometry data using the MsExperiment and Spectra package, and the LC-MS data preprocessing using xcms. The first part of the workflow is focused on data import, access and visualization which is followed by the description of a simple data centroiding approach and finally the xcms-based LC-MS data preprocessing that comprises chromatographic peak detection, alignment and correspondence. The workflow does not cover data normalization procedures, compound identification and differential abundance analysis.

3.1 Data import and exploration

The example data set of this workflow consists of two files in mzML format with signals from pooled human serum samples measured with a ultra high performance liquid chromatography (UHPLC) system (Agilent 1290) coupled with a Q-TOF MS (TripleTOF 5600+ AB Sciex) instrument. Chromatographic separation was based on hydrophilic interaction liquid chromatography (HILIC) separating metabolites depending on their polarity. The setup thus allows to measure small polar compounds and hence metabolites from the main metabolic pathways. The input files contain all signals measured by the MS instrument (so called profile mode data). To reduce file sizes, the data set was restricted to an m/z range from 105 to 134 and retention times from 0 to 260 seconds.

In the code block below we first load all required libraries and define the location of the mzML files, which are part of the msdata R package. We also define a data.frame describing the samples/experiment and pass this to the readMsExperiment function that imports the data. Similar to the on-disk-mode described in (Gatto, Gibb, and Rainer 2020), the m/z and intensity values are not immediately loaded into the memory but only when required which enables also analyses of very large experiments.

library(xcms)
library(MsExperiment)
library(Spectra)

## Define the file names.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)

## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls),
                 injection_idx = c(1, 19),
                 sample = c("POOL_1", "POOL_2"),
                 group = "POOL")
data <- readMsExperiment(fls, sampleData = pd)
data
## Object of class MsExperiment 
##  Spectra: MS1 (1862) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 1862 element(s).

Note that for a real experiment it is suggested to define such a phenodata table as an e.g. tab delimited text file (or an xls sheet) that contains all of the raw data file names along with all the relevant sample information for each file as additional columns. Such a data table could then be imported into R using e.g. read.table or read_xlsx (from the readxl R package) and the file names could be passed to readMsExperiment with files = paste0(MZML_PATH, "/", pd$mzML_file) where MZML_PATH would be a variable specifying the directory containing the data (mzML) files, pd would be the imported phenodata table with a column named "mzML_file" containing the names of the individual raw files.

Next we set up parallel processing. This ensures that all required cores are registered and available from the beginning of the analysis. All data access and analysis functions of xcms on a per-file basis and will use this setup by default.

#' Set up parallel processing using 2 cores
if (.Platform$OS.type == "unix") {
    register(bpstart(MulticoreParam(2)))
} else {
    register(bpstart(SnowParam(2)))
}

The MS data of the experiment is now represented by a MsExperiment object. Phenotype information can be retrieved with the sampleData function from that object.

#' Access phenotype information
sampleData(data)
## DataFrame with 2 rows and 5 columns
##            file injection_idx      sample       group spectraOrigin
##     <character>     <numeric> <character> <character>   <character>
## 1 20171016_P...             1      POOL_1        POOL /usr/local...
## 2 20171016_P...            19      POOL_2        POOL /usr/local...

The MS data is stored as a Spectra object within the MsExperiment and can be accessed using the spectra function.

spectra(data) 
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## ...        ...       ...       ...
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML

This Spectra object represents the full LC-MS data of the experiment. Each element in this object is a spectrum (in one sample/file) with all information provided by the respective original data (mzML) file. Spectra are organized linearly, i.e. all spectra from all files are within the same Spectra object, one after the other.

3.2 Basic data access and visualization

The MS data of an experiment is stored as a Spectra object within the MsExperiment and this MsExperiment contains the experiment’s sample information and manages the linkage between samples and spectra. The length of an MsExperiment is defined by the number of samples within the object.

length(data)
## [1] 2

As detailed above, the spectra function can be used to access the MS data of the experiment. As an example we below calculate the retention time range of the full experiment. Retention times can be extracted with the rtime function from the Spectra object within data. To avoid nested function calls and hence improve the readability of the code, we use the R pipe operator |> that allows to concatenate consecutive calls in a more readable fashion.

spectra(data) |>
rtime() |>
range()
## [1]   0.275 259.757

The Spectra object returned by spectra, contains all spectra from all files (samples) of an experiment. To access MS data from only a single file (sample) we need to subset the MsExperiment object to that particular sample before extracting the spectra data with spectra. Below we subset the data to the second sample using the [ function, access the spectra from that sample, extract their retention times and calculate their range.

data[2] |>
spectra() |>
rtime() |>
range()
## [1]   0.275 259.752

We can also access a single spectrum from this second sample and visualize its data. We thus below extract the spectrum number 123 from the second sample and use the plotSpectra function to plot that spectrum.

sp <- spectra(data[2])[123]
plotSpectra(sp)

We can see several relatively large signals (peaks) in that spectrum and also a number of low intensity peaks. This spectrum represents a single scan of the LC-MS data run of the second sample. It’s retention time can be extracted with rtime, intensity and m/z values can be extracted with the intensity and mz functions.

rtime(sp)
## [1] 34.314
intensity(sp)
## NumericList of length 1
## [[1]] 0 282 0 141 0 0 141 0 141 0 141 0 ... 563 563 422 0 0 282 282 0 282 141 0
mz(sp)
## NumericList of length 1
## [[1]] 105.95354942709 105.955001209814 ... 133.105299625013 133.106926815539

Note that, even for a Spectra with data from a single spectrum, the intensity and mz functions return a NumericList with the intensity and m/z values.

To get a general overview of the full data we combine all spectra measured for one sample into a single spectrum by reporting the maximum intensity of peaks with highly similar m/z values across all spectra of a sample using the combineSpectra function from the Spectra package. Parameter f allows to define which spectra in the data should be combined into a single spectrum. Since we want to aggregate all spectra from one sample, we use f = fromFile(data) (fromFile returns the sample index for each individual spectrum in the data set). combineSpectra uses the combinePeaks function to aggregate signal for the defined groups of spectra. It provides a large variety of possibilities to combine spectra, but in our case we simply want to sum the signal of peaks with a very similar m/z and hence we use parameters intensityFun = max and ppm = 10. See also ?combinePeaks for the full set of parameters and aggregation options.

bps <- combineSpectra(spectra(data), f = fromFile(data),
                      intensityFun = max, ppm = 10)

plotSpectra(bps)
Base peak spectrum for each of the two samples.

Figure 1: Base peak spectrum for each of the two samples

In contrast to the single spectrum before, the base peak spectrum (BPS) above shows more peaks indicating presence of more ions in the samples for the m/z range.

In addition to a BPS we can also create a base peak chromatogram (BPC) aggregating peak intensities for each scan (spectrum) per sample. This BPC is orthogonal to the BPS and provides general information on each LC run of the experiment. We use below the chromatogram function to extract this data with the additional parameter aggregationFun = "max" to report the maximal intensity for each spectrum (and hence discrete retention time).

bpc <- chromatogram(data, aggregationFun = "max")
plot(bpc)
Base peak chromatogram for the two samples. Each line represents one sample.

Figure 2: Base peak chromatogram for the two samples
Each line represents one sample.

The BPC shows some noisy signal at the beginning of the chromatography, but also some distinct (chromatographic) peak signal. In addition, we can see slight drifts in retention time between the two samples.

Apart from getting a general overview of the data it is also possible to explore the data in more detail. To this end we next focus on a specific subset of the data were we expect signal for a compound that should be present in serum samples. We thus filter below the spectra data using the filterRt function extracting only spectra measured between 180 and 181 seconds.

sps <- spectra(data) |>
filterRt(c(180, 181))
sps
## MSn data (Spectra) with 6 spectra in a MsBackendMzR backend:
##       msLevel     rtime scanIndex
##     <integer> <numeric> <integer>
## 1           1   180.240       646
## 2           1   180.519       647
## ...       ...       ...       ...
## 5           1   180.514       647
## 6           1   180.793       648
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML
## Processing:
##  Filter: select retention time [180..181] on MS level(s) 1 [Thu May 25 08:48:42 2023]

For the present data set there were 6 spectra measured within this one second in the two samples. By extracting the data as a Spectra object we have however lost now the direct (inherent) association between spectra and samples of the experiment. We could extract the name of the original data file from which the data was imported (see example below) and use that to determine the originating sample, but that would involve additional R code.

basename(dataOrigin(sps))
## [1] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
## [3] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_3_105-134.mzML"
## [5] "20171016_POOL_POS_3_105-134.mzML" "20171016_POOL_POS_3_105-134.mzML"

Alternatively, we could use the filterRt function also directly on the MsExperiment which would subset the MsExperiment and hence the link between samples and spectra would remain intact. Note however that only few filter and subset functions are available for MsExperiment objects while a large variety of useful filters are available for Spectra.

#' subset the whole MsExperiment
data_sub <- filterRt(data, rt = c(180, 181))
#' extract spectra from the subset for the first sample
spectra(data_sub[1L])
## MSn data (Spectra) with 3 spectra in a MsBackendMzR backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1   180.240       646
## 2         1   180.519       647
## 3         1   180.798       648
##  ... 33 more variables/columns.
## 
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## Processing:
##  Filter: select retention time [180..181] on MS level(s) 1 [Thu May 25 08:48:42 2023]

For the present purpose it is however not important to keep the sample association intact and we thus proceed to plot the previously extracted spectra.

plotSpectra(sps)
MS1 spectra measured between 180 and 181 seconds

Figure 3: MS1 spectra measured between 180 and 181 seconds

We can immediately spot several mass peaks in the spectrum, with the largest one at an m/z of about 130 and the second largest at about 106, which could represent signal for an ion of Serine. Below we calculate the exact (monoisotopic) mass for serine from its chemical formula C3H7NO3 using the calculateMass function from the MetaboCoreUtils package.

library(MetaboCoreUtils)
mass_serine <- calculateMass("C3H7NO3")
mass_serine
##  C3H7NO3 
## 105.0426

The native serine molecule is however uncharged and can thus not be measured by mass spectrometry. In order to be detectable, molecules need to be first ionized before being injected in an MS instrument. While different ions can (and will) be generated for a molecule, one of the most commonly created ions (adducts) in positive polarity is the [M+H]+ ion (protonated ion). To calculate the m/z values for specific ions/adducts of molecules, we can use the mass2mz function, also from the MetaboCoreUtils package. Below we calculate the m/z for the [M+H]+ ion of serine providing the monoisotopic mass of that molecule and specifying the ion we are interested in. Also other types of adducts are supported. These could be listed with the adductNames function (adductNames() for all positively charged and adductNames("negative") for all negatively charge ions).

serine_mz <- mass2mz(mass_serine, "[M+H]+")
serine_mz
##           [M+H]+
## C3H7NO3 106.0499

The mass2mz function always returns a matrix with columns reporting the m/z for the requested adducts of the molecules (rows). Since we requested a single ion we reduce this matrix to a single numeric.

serine_mz <- serine_mz[1, 1]

We can now use this information to subset the MS data to the signal recorded for all ions with that particular m/z. We use again the chromatogram function and provide the m/z range of interest with the mz parameter.

serine_chr <- chromatogram(data, mz = serine_mz + c(-0.05, 0.05))
plot(serine_chr)
Ion trace for an ion of serine

Figure 4: Ion trace for an ion of serine

A strong signal is visible around a retention time of 180 seconds which very likely represents signal for the [M+H]+ ion of serine.

The object returned by the chromatogram function arranges the individual Chromatogram objects (one per sample) in a two-dimensional array, columns being samples (files) and rows data slices (i.e. m/z - rt ranges). This type of data representation is likely to be replaced in future with a more efficient and flexible data structure similar to Spectra. Data from the individual chromatograms can be accessed using the intensity and rtime functions.

#' get intensity and retention times for the chromatogram of the first
#' sample
ints <- intensity(serine_chr[1, 1])
head(ints) 
## [1]  NA 559 659 278 492  NA
rts <- rtime(serine_chr[1, 1])
head(rts)
## [1] 0.280 0.559 0.838 1.117 1.396 1.675

At last we further focus on the tentative signal of serine extracting the ion chromatogram restricting on a certain retention time range. While we could also pass the retention time and m/z range with parameters rt and mz to the chromatogram function we instead filter the whole experiment by retention time and m/z before calling chromatogram on the such created data subset.

data |>
filterRt(rt = c(175, 189)) |>
filterMz(mz = serine_mz + c(-0.05, 0.05)) |>
chromatogram() |>
plot()
Extracted ion chromatogram for serine.

Figure 5: Extracted ion chromatogram for serine

The area of such a chromatographic peak is supposed to be proportional to the amount of the corresponding ion in the respective sample and identification and quantification of such peaks is one of the goals of the LC-MS data preprocessing.

3.3 Centroiding of profile MS data

MS instruments allow to export data in profile or centroid mode. Profile data contains the signal for all discrete m/z values (and retention times) for which the instrument collected data (R. Smith et al. 2014). MS instruments continuously sample and record signals and a mass peak for a single ion in one spectrum will thus consist of a multiple intensities at discrete m/z values. Centroiding is the process to reduce these mass peaks to a single representative signal, the centroid. This results in much smaller file sizes, without loosing too much information. xcms, specifically the centWave chromatographic peak detection algorithm, was designed for centroided data, thus, prior to data analysis, profile data, such as the example data used here, should be centroided.

Below we inspect the profile data for the [M+H]+ ion adduct of Serine. We again subset the data to the m/z and retention time range containing signal from Serine and use the plot function on the subset MsExperiment to visualize the full MS data. This plot function should ideally only called on subsets of an MsExperiment but not on the full MS data. The plot visualizes the intensities of individual peaks (color coded) in the two-dimensional retention time against m/z space.

data |>
filterRt(rt = c(175, 189)) |>
filterMz(mz = c(106.02, 106.07)) |>
plot() 
Profile data for Serine.

Figure 6: Profile data for Serine

The plot shows all data points measured by the instrument. Each column of data points in the lower panel represents the signal measured at one discrete time point, stored in one spectrum. We can see a distribution of the signal for serine in both retention time and also in m/z dimension.

Next we smooth the data in each spectrum using a Savitzky-Golay filter, which usually improves data quality by reducing noise. Subsequently we perform the centroiding of the data based on a simple peak-picking strategy that reports the maximum signal for each mass peak in each spectrum and replace the spectra data in our data (MsExperiment) object.

#' Smooth and centroid the spectra data
sps_cent <- spectra(data) |>
smooth(method = "SavitzkyGolay", halfWindowSize = 6L) |>
pickPeaks(halfWindowSize = 2L)

#' replace spectra
spectra(data) <- sps_cent

#' Plot the centroided data for Serine
data |>
filterRt(rt = c(175, 189)) |>
filterMz(mz = c(106.02, 106.07)) |>
plot() 
Centroided data for Serine.

Figure 7: Centroided data for Serine

The centroiding reduced the data to a single data point for an ion in each spectrum. For more advanced centroiding options that can also fine-tune the m/z value of the reported centroid see the pickPeaks help or the centroiding vignette of the MSnbase package.

We next export the centroided MS data to files in mzML format and re-read the data set from these.

We use the export function for data export of the centroided Spectra object. Parameter backend allows to specify the MS data backend that should be used for the export, and that will also define the data format (use backend = MsBackendMzR() to export data in mzML format. Parameter file defines, for each spectrum, the name of the file to which its data should be exported.

export(spectra(data), backend = MsBackendMzR(),
       file = basename(dataOrigin(spectra(data))))

We next import the centroided data again from the newly generated mzML files.

fls <- basename(fls)

#' Read the centroided data.
data <- readMsExperiment(fls, sampleData = pd) 

3.4 Preprocessing of LC-MS data

Preprocessing of (untargeted) LC-MS data aims at detecting and quantifying the signal from ions generated from all molecules present in a sample. It consists of the 3 steps chromatographic peak detection, alignment (also called retention time correction) and correspondence (also called peak grouping). The resulting matrix of feature abundances can then be used as an input in downstream analyses including data normalization, identification of features of interest and annotation of features to metabolites.

3.4.1 Chromatographic peak detection

Chromatographic peak detection aims to identify peaks along the retention time axis that represent the signal from individual compounds’ ions. Such peak detection can be performed with the xcms package using its findChromPeaks function. Several peak detection algorithms are available that can be configured with its specific parameter object: MatchedFilterParam to perform peak detection as described in the original xcms article (C. A. Smith et al. 2006), CentWaveParam to perform a continuous wavelet transformation (CWT)-based peak detection (Tautenhahn, Böttcher, and Neumann 2008) and MassifquantParam to perform a Kalman filter-based peak detection (Conley et al. 2014). Additional peak detection algorithms for direct injection data are also available, but not discussed here.

In our example we use the centWave algorithm that performs peak detection in two steps: first it identifies regions of interest in the m/z - retention time space and subsequently detects peaks in these regions using a continuous wavelet transform (see the original publication for more details). The algorithm can be configured with several parameters (see ?CentWaveParam), with the most important being peakwidth and ppm. peakwidth defines the minimal and maximal expected width of the peak in retention time dimension and depends thus on the setup of the employed LC-MS system making this parameter highly data set dependent. Appropriate values can be estimated based on extracted ion chromatograms of e.g. internal standards or known compounds in the data. We thus below extract the chromatographic data for serine and perform a peak detection on that data subset using findChromPeaks and the default parameters for centWave.

#' Get the XIC for serine in all files
serine_chr <- chromatogram(data, rt = c(164, 200),
                           mz = serine_mz + c(-0.05, 0.05),
                           aggregationFun = "max")

#' Get default centWave parameters
cwp <- CentWaveParam()

#' "dry-run" peak detection on the XIC.
res <- findChromPeaks(serine_chr, param = cwp)
chromPeaks(res)
##      rt rtmin rtmax into intb maxo sn row column

With the default settings for centWave, findChromPeaks failed to detect any chromatographic peak in that data. These default values are shown below.

cwp
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 25
##  - peakwidth: [1] 20 50
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 0
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE

Particularly the settings for peakwidth does not fit our data. The default for this parameter expects chromatographic peaks between 20 and 50 seconds wide. When we plot the extracted ion chromatogram (XIC) for serine we can however see that these values are too large for the present data set (see below).

plot(serine_chr)
Extracted ion chromatogram for serine.

Figure 8: Extracted ion chromatogram for serine

In fact, the serine signal seems to be measured for around 5 seconds. We thus next adapt the settings to accommodate peaks ranging from 2 to 10 seconds and re-run the peak detection. In general, it is advised to investigate peak widths for several ions in the data set to determine the most appropriate peakwidth setting. In addition, we select a different peak boundary estimation algorithm by setting integrate = 2. This works particularly well for non-gaussian peak shapes and ensures that also signal from peak’s tail is integrated (eventually re-run the code with the default integrate = 1 to compare the two approaches).

cwp <- CentWaveParam(peakwidth = c(2, 10), integrate = 2)

serine_chr <- findChromPeaks(serine_chr, param = cwp)

#' Plot the data and higlight identified peak area
plot(serine_chr)
XIC for Serine with detected chromatographic peak

Figure 9: XIC for Serine with detected chromatographic peak

With our data set-specific peakwidth we were able to detect the peak for serine (highlighted in grey in the plot above). We can now use the chromPeaks function to extract the information on identified chromatographic peaks from our object.

chromPeaks(serine_chr)
##           rt   rtmin   rtmax     into     intb     maxo  sn row column
## [1,] 181.356 178.566 189.447 74443.95 71734.01 37664.94 110   1      1
## [2,] 181.072 178.561 187.210 70352.22 69008.98 38517.76 224   1      2

The result is returned as a matrix with the retention time and m/z range of the peak ("rtmin", "rtmax", "mzmin" and "mzmax") as well as the integrated peak area ("into"), the maximal signal ("maxo") and the signal to noise ratio ("sn").

Another important parameter for centWave is ppm which is used in the initial identification of the regions of interest. In contrast to random noise, the real signal from an ion is expected to yield stable m/z values in consecutive scans (the scattering of the m/z values around the real m/z value of the ion is supposed to be inversely related with its intensity - at least for TOF instruments). In centWave, all peaks with m/z values that differ by less than ppm in consecutive spectra are combined into a region of interest (ROI) that is then subjected to the CWT-based peak detection. To illustrate this, we plot the full MS data for the data subset containing signal for serine.

#' Restrict the data to signal from Serine
srn <- data |>
filterRt(rt = c(179, 186)) |>
filterMz(mz = c(106.04, 106.07))

#' Plot the data
plot(srn) 

We can observe some scattering of the data points in m/z dimension (lower panel in the plot above), that decreases with increasing intensity of the signal.

We next calculate the differences in m/z values between consecutive scans in this data subset. We do this for one representative file, selecting the one with the highest total sum of intensities in the selected region. We calculate the total intensity by summing all intensities of all spectra within the region for each of the two files (the first sum on the intensities returned by intensity sums all intensities within each spectrum, the second sum sums these values to result in the total sum).

#' split the experiment by sample
srn_sample <- split(srn, seq_along(srn))

#' apply a function to each experiment that
#' - extracts the spectra for that sample
#' - gets their intensities
#' - sums these intensities per spectrum
#' - finally sums these intensity sums across spectra
int_sums <- vapply(srn_sample,
                   function(z) sum(sum(intensity(spectra(z)))),
                   numeric(1))
int_sums
##        1        2 
## 264511.6 251648.7

We next extract the spectra from the sample with the highest intensity and evaluate the number of peaks we’ve got for the m/z - retention time range of our data subset.

srn_sps <- spectra(srn[which.max(int_sums)])

#' get the number of peaks for each spectrum
lengths(srn_sps)
##  [1] 1 2 1 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 1 1 2 2 2 2 2

For some spectra with have more than one peak in the MS data subset. For these we next select the peak with the highest intensity. Here we can make use of the powerful infrastructure of the Spectra package, that allows to apply any user-provided function to the peak matrix of each spectrum. We thus next define a simple function that takes a peak matrix as input, subsets that to the one row (i.e. peak) with the highest intensity and returns that single-row (peak) matrix again.

#' function to select the row with the highest intensity and return that
max_int <- function(x, ...) {
    x[which.max(x[, "intensity"]), , drop = FALSE]
}

We can now apply this function to the data using the addProcessing function. Any user provided function that take a peak matrix as first argument and that return a peak matrix can be passed to that addProcessing function. Note also that the function has to have the ... parameter in its function definition (even if no additional parameters are passed along). See also ?addProcessing for more information. Below we apply this function to the extracted spectra and determine the number of peaks after processing.

srn_sps <- addProcessing(srn_sps, max_int)
lengths(srn_sps)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We indeed have now a single peak per spectrum. We next extract the m/z values for these peaks and calculate the difference of m/z values between consecutive scans (spectra).

mzs <- unlist(mz(srn_sps))
abs(diff(mzs))
##           mz           mz           mz           mz           mz           mz 
## 1.452460e-03 2.904891e-03 1.179878e-04 1.452442e-03 0.000000e+00 1.684509e-05 
##           mz           mz           mz           mz           mz           mz 
## 0.000000e+00 0.000000e+00 7.233670e-05 0.000000e+00 0.000000e+00 7.624200e-07 
##           mz           mz           mz           mz           mz           mz 
## 1.452441e-03 1.452441e-03 1.358206e-03 0.000000e+00 0.000000e+00 1.425717e-03 
##           mz           mz           mz           mz           mz           mz 
## 0.000000e+00 1.452441e-03 1.480143e-03 0.000000e+00 0.000000e+00 1.493783e-03

We can also express these differences in ppm (parts per million) of the average m/z of the peaks.

abs(diff(mzs)) * 1e6 / mean(mzs)
##           mz           mz           mz           mz           mz           mz 
## 13.695973646 27.391665930  1.112565444 13.695804399  0.000000000  0.158840806 
##           mz           mz           mz           mz           mz           mz 
##  0.000000000  0.000000000  0.682098923  0.000000000  0.000000000  0.007189239 
##           mz           mz           mz           mz           mz           mz 
## 13.695795336 13.695795336 12.807200180  0.000000000  0.000000000 13.443799681 
##           mz           mz           mz           mz           mz           mz 
##  0.000000000 13.695795190 13.957010392  0.000000000  0.000000000 14.085629933

The difference in m/z values for the serine data is thus between 0 and 27 ppm. This should ideally be evaluated for several compounds and should be set to a value that allows to capture the full chromatographic peaks for most of the tested compounds. We can next perform the peak detection using our settings for the ppm and peakwidth parameters.

#' Perform peak detection
cwp <- CentWaveParam(peakwidth = c(2, 10), ppm = 30, integrate = 2)
data <- findChromPeaks(data, param = cwp) 

The findChromPeaks call adds the results from the chromatographic peak detection to our data set (which is now represented by an XcmsExperiment object that directly extends MsExperiment and hence inherits all of its functionality).

data
## Object of class XcmsExperiment 
##  Spectra: MS1 (1862) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 1862 element(s).
##  xcms results:
##   - chromatographic peaks: 653 in MS level(s): 1

The results from the peak detection analysis can be accessed with the chromPeaks function. The optional parameters rt and mz allow in addition to extract peak detection results for a specific m/z - retention time region:

#' Access the peak detection results from a specific m/z - rt area
chromPeaks(data, mz = c(106, 107), rt = c(150, 190)) 
##             mz    mzmin    mzmax      rt   rtmin   rtmax       into       intb
## CP133 106.0625 106.0606 106.0636 173.264 171.869 174.380   516.3588   510.5323
## CP167 106.0506 106.0505 106.0506 181.356 178.845 187.773 74181.7823 73916.8683
## CP475 106.0633 106.0609 106.0652 172.701 170.748 174.654   559.5491   553.2785
## CP516 106.0496 106.0494 106.0508 181.072 178.282 187.210 70373.6099 70106.7152
##             maxo  sn sample
## CP133   426.6084  38      1
## CP167 37664.9371 688      1
## CP475   381.6084  53      2
## CP516 38517.7622 826      2

For each identified peak the m/z and rt value of the apex is reported (columns "mz" and "rt") as well as their ranges ("mzmin", "mzmax", "rtmin", "rtmax"), the integrated signal of the peak (i.e. the peak area "into"), the maximal signal of the peak ("maxo"), the signal to noise ratio ("sn") and the index of the sample in which the peak was detected ("sample").

Note that after peak detection, extracted ion chromatogram data will also contain identified chromatographic peaks. Below we extract the EIC for Serine and display the identified peaks for that compound.

eic_serine <- chromatogram(data, mz = c(106.04, 106.06),
                           rt = c(179, 186))
chromPeaks(eic_serine)
##             mz    mzmin    mzmax      rt   rtmin   rtmax     into     intb
## CP167 106.0506 106.0505 106.0506 181.356 178.845 187.773 74181.78 73916.87
## CP516 106.0496 106.0494 106.0508 181.072 178.282 187.210 70373.61 70106.72
##           maxo  sn sample row column
## CP167 37664.94 688      1   1      1
## CP516 38517.76 826      2   1      2

And plotting this extracted ion chromatogram will also show the identified chromatographic peak(s).

plot(eic_serine)

This thus provides a convenient way to evaluate peak detection results for sets of m/z - retention time regions for potential known compounds.

Similarly, peak detection results will be visualized in the generic plot of the result object (chromatogrphic peaks are highlighed as a red rectangle; see below).

srn <- data |>
filterRt(rt = c(175, 188)) |>
filterMz(mz = c(106.04, 106.06))

plot(srn)

Peak detection will not always work perfectly leading to peak detection artifacts, such as overlapping peaks or artificially split peaks. The refineChromPeaks function allows to refine peak detection results by either removing peaks not meeting predefined properties or by merging artificially split chromatographic peaks (see ?refineChromPeaks for all options). Below we post-process the peak detection results merging peaks (within each sample) that overlap in a 4 second window if the signal between in between them is lower than 75% of the smaller peak’s largest intensity. See the MergeNeighboringPeaksParam help page for a detailed description of the settings and the approach.

mpp <- MergeNeighboringPeaksParam(expandRt = 4)
data_pp <- refineChromPeaks(data, param = mpp)

Merged peaks can be identified with a TRUE in the "merged" chromatographic peak metadata column that can be extracted with the chromPeakData function (see below).

chromPeakData(data_pp)
## DataFrame with 595 rows and 3 columns
##        ms_level is_filled    merged
##       <integer> <logical> <logical>
## CP001         1     FALSE     FALSE
## CP002         1     FALSE     FALSE
## ...         ...       ...       ...
## CP682         1     FALSE      TRUE
## CP683         1     FALSE      TRUE

An example for a joined (merged) peak is given below.

mzr <- c(122.9, 122.97)
rtr <- c(100, 150)

chr_1 <- chromatogram(data[1], mz = mzr, rt = rtr)
chr_2 <- chromatogram(data_pp[1], mz = mzr, rt = rtr)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
Result from the peak refinement. Left: peaks before merging, right after merging.

Figure 10: Result from the peak refinement
Left: peaks before merging, right after merging.

The signal is rather wide and noise and centWave thus was not able to integrate the whole peak and even defined partially overlapping chromatographic peaks (left in the image above). Peak refinement merged all the consecutive peaks as the signal between them never dropped below 75% of the smallest chromatographic peak’s intensity.

At last we replace the data variable with the object containing also the peak refinement results.

data <- data_pp

For quality assessment we could now calculate summary statistics on the identified peaks to e.g. identify samples with much less detected peaks. Also, we can use the plotChromPeaks function to provide some general information on the location of the identified chromatographic peaks in the m/z - rt space.

par(mfrow = c(1, 2))
plotChromPeaks(data, 1)
plotChromPeaks(data, 2) 
Location of the identified chromatographic peaks in the m/z - rt space.

Figure 11: Location of the identified chromatographic peaks in the m/z - rt space

3.4.2 Alignment

While chromatography helps to discriminate better between analytes it is also affected by variances that can lead to shifts in retention times between measurement runs. The alignment step aims to adjust these retention time differences between samples within an experiment. Below we plot the base peak chromatograms of both files of our data set to visualize these differences. Note that with peakType = "none" we disable plotting of identified chromatographic peaks that would be drawn by default on chromatograms extracted from an object containing peak detection results.

#' Extract base peak chromatograms
bpc_raw <- chromatogram(data, aggregationFun = "max")
plot(bpc_raw, peakType = "none")
BPC of all files.

Figure 12: BPC of all files

While both samples were measured with the same setup on the same day slight drifts of the signal are visible.

Alignment can be performed with xcms using the adjustRtime function that supports the peakGroups (C. A. Smith et al. 2006) and the obiwarp (Prince and Marcotte 2006) method. The settings for the algorithms can be defined with the PeakGroupsParam and the ObiwarpParam parameter objects, respectively (see ?adjustRtime for all available alignment methods and their settings). Note also that xcms supports a subset-based alignment which allows to align a large data set with the retention time drift estimated on a subset of the samples (ideally QC or pooled samples repeatedly measured over the whole measurement run(s)). See the xcms vignette for more information.

For our example we use the peakGroups method that aligns samples based on the retention times of hook peaks (or housekeeping peaks), which represent signal from ions expected to be present in most samples. To define these we need however to group detected chromatographic peaks across samples using an initial correspondence analysis. Below we use the peakDensity method for correspondence. Details about this method and explanations on the choices of its parameters are provided in the next section. After having performed this initial correspondence analysis, we perform the alignment using settings minFraction = 1 and span = 0.6. minFraction defines the proportion of samples in which a candidate hook peak has to be detected/present. A value of 0.9 would for example require the chromatographic peak for a specific ion (i.e. m/z - retention time) to be present (detected) in 90% of samples to consider it as a hook peak for the alignment. Our data represents replicated measurements of the same sample pool and we can therefore require hook peaks to be present in each file. The parameter span defines the degree of smoothing of the loess function that is used to allow different regions along the retention time axis to be adjusted by a different factor. A value of 0 will most likely cause overfitting, while 1 would perform a constant, linear shift. Values between 0.4 and 0.6 seem to be reasonable for most experiments.

#' Define the settings for the initial peak grouping - details for
#' choices in the next section.
pdp <- PeakDensityParam(sampleGroups = sampleData(data)$group, bw = 1.8,
                        minFraction = 1, binSize = 0.02)
data <- groupChromPeaks(data, pdp)

#' Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data <- adjustRtime(data, param = pgp) 

Adjusted retention times are stored, along with the raw retention times, within the result object. Any function accessing retention times (such as rtime) will by default return adjusted retention times from an XcmsExperiment object, if present. Note that also the retention times of the identified chromatographic peaks were adjusted by the adjustRtime call. After alignment it is suggested to evaluate alignment results e.g. by inspecting differences between raw and adjusted retention times.

#' Plot the difference between raw and adjusted retention times
plotAdjustedRtime(data)
Alignment results. Shown is the difference between raw and adjusted retention times and the hook peaks that were used for the alignment (shown as points).

Figure 13: Alignment results
Shown is the difference between raw and adjusted retention times and the hook peaks that were used for the alignment (shown as points).

The difference between raw and adjusted retention time should be reasonable. In our example it is mostly below one second, which is OK since the samples were measured within a short time period and differences are thus expected to be small. Also, hook peaks should ideally be present along the full retention time range to allow proper estimation of drifts along the full time of measurement.

Next we plot the base peak chromatograms before and after alignment. Note that a chromatogram call will always include all detected chromatographic peaks in the requested rt range but for a base peak chromatogram we don’t need this information. With chromPeaks = "none" in the chromatogram call we tell the function to not include any identified chromatographic peaks in the returned chromatographic data.

par(mfrow = c(2, 1))
#' Plot the raw base peak chromatogram
plot(bpc_raw, peakType = "none")
#' Plot the BPC after alignment
plot(chromatogram(data, aggregationFun = "max", chromPeaks = "none"))
BPC before (top) and after (bottom) alignment.

Figure 14: BPC before (top) and after (bottom) alignment

The base peak chromatograms are nicely aligned after retention time adjustment. The impact of the alignment should also be evaluated on known compounds or internal standards. We thus plot below the XIC for serine before and after alignment. To get raw retention times and hence be able to extract the ion chromatogram of serine before alignment we have to drop the alignment results which can be done with the dropAdjustedRtime function. This function will restore the retention times of spectra (and chromatographic peaks) before alignment.

data_raw <- dropAdjustedRtime(data)

We can thus extract next the ion chromatogram for serine from the raw data as well as after alignment.

#' Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 1, 0.5))
chromatogram(data_raw, mz = c(106.04, 106.06), rt = c(179, 186)) |>
plot()
chromatogram(data, mz = c(106.04, 106.06), rt = c(179, 186)) |>
plot()
XIC for Serine before (left) and after (right) alignment

Figure 15: XIC for Serine before (left) and after (right) alignment

The Serine peaks are also nicely aligned after adjustment.

3.4.3 Correspondence

The final step of the LC-MS preprocessing with xcms is the correspondence analysis, in which chromatographic peaks from the same types of ions (compounds) are grouped across samples to form a so called LC-MS feature. xcms implements two methods for this purpose: peak density (C. A. Smith et al. 2006) and nearest (Katajamaa, Miettinen, and Oresic 2006) that can be configured by passing either a PeakDensityParam or a NearestPeaksParam object to the groupChromPeaks function (see ?groupChromPeaks for more information). For our example we use the peak density method that iterates through m/z slices in the data and groups chromatographic peaks to features in each slice (within the same or across samples) depending on their retention time and the distribution of chromatographic peaks along the retention time axis. Peaks representing signal from the same ion are expected to have a similar retention time and, if found in many samples, this should also be reflected by a higher peak density at the respective retention time (since the retention times for this compound are expected to be similar in the different samples). To illustrate this we extract below an m/z slice containing the serine peak and use the plotChromPeakDensity function to visualize the distribution of peaks along the retention time axis and to simulate a correspondence analysis based on the provided settings.

#' Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = sampleData(data)$group)

#' Extract a BPC for the m/z slice containing serine
bpc_serine <- chromatogram(data, mz = c(106.04, 106.06),
                           aggregationFun = "max")

#' Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_serine, param = pdp)
BPC for a m/z slice and defined features within this slice based on default settings.

Figure 16: BPC for a m/z slice and defined features within this slice based on default settings

The upper panel in the plot above shows the chromatographic data with the identified peaks. The lower panel shows the retention time of identified peaks (x-axis) per sample (y-axis) with the black solid line representing their distribution (density) along the x-axis. Peak groups (features) are indicated with grey rectangles in the lower panel. The peak density correspondence method groups all chromatographic peaks under the same density peak into one feature. With the default settings we were able to group the serine peak of each sample into one feature. The parameters for the peak density correspondence analysis are:

  • binSize: m/z width of the bin/slice of data in which peaks are grouped.
  • bw defines the smoothness of the density function.
  • maxFeatures: maximum number of features to be defined in one bin.
  • minFraction: minimum proportion of samples (of one group!) for which a peak has to be present.
  • minSamples: minimum number of samples a peak has to be present.

The parameters minFraction and minSamples depend on the experimental layout and should be set accordingly. binSize should be set to a small enough value to avoid peaks from different ions, but with similar m/z and retention time, being grouped together. The most important parameter however is bw and, while its default value of 30 was able to correctly group the Serine peaks, it should always be evaluated on other, more complicated, signals too. Below we evaluate the performance of the default parameters on an m/z slice that contains signal from multiple ions with the same m/z, including isomers betaine and valine ([M+H]+ m/z 118.08625).

#' Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data, mz = mzr, aggregationFun = "max")

#' Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = sampleData(data)$group)
plotChromPeakDensity(chr, param = pdp)
Correspondence analysis with default settings on an m/z slice containing signal from multiple ions.

Figure 17: Correspondence analysis with default settings on an m/z slice containing signal from multiple ions

With default settings all chromatographic peaks present in the m/z slice were grouped into one feature. Signal from different ions would thus be treated as a single entity. Below we repeat the analysis with a strongly reduced value for bw.

#' Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = sampleData(data)$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp)
Correspondence analysis with reduced bw setting on a m/z slice containing signal from multiple ions.

Figure 18: Correspondence analysis with reduced bw setting on a m/z slice containing signal from multiple ions

Using a value of 1.8 for parameter bw, we successfully grouped the peaks into different features. We can now use these settings for the correspondence analysis on the full data set.

pdp <- PeakDensityParam(sampleGroups = sampleData(data)$group, bw = 1.8,
                        minFraction = 0.4, binSize = 0.02)

#' Perform the correspondence analysis
data <- groupChromPeaks(data, param = pdp) 

Next we evaluate the results from the correspondence analysis on a different m/z slice containing isomers leucine and isoleucine ([M+H]+ m/z 132.10191). Setting simulate = FALSE in plotChromPeakDensity will show the actual results from the correspondence analysis.

#' Plot the results for an m/z slice containing Leucine and Isoleucine
mzr <- 132.10191 + c(-0.01, 0.01)
chr <- chromatogram(data, mz = mzr, aggregationFun = "max")
plotChromPeakDensity(chr, simulate = FALSE)
Result of correspondence on a slice containing the isomers Leucine and Isoleucine.

Figure 19: Result of correspondence on a slice containing the isomers Leucine and Isoleucine

Despite being very close, chromatographic peaks of isomers were successfully grouped into separate features.

Results from the correspondence analysis can be accessed with the featureDefinition function. This function returns a data frame with the retention time and m/z ranges of the apex positions from the peaks assigned to the feature and their respective indices in the chromPeaks matrix.

#' Definition of the features
featureDefinitions(data) |> head()
##          mzmed    mzmin    mzmax     rtmed     rtmin     rtmax npeaks POOL
## FT001 105.0418 105.0417 105.0418 167.68597 167.48455 167.88740      2    2
## FT002 105.0415 105.0415 105.0415 157.71871 157.71871 157.71871      1    1
## FT003 105.0697 105.0691 105.0703  31.80794  31.68918  31.92670      2    2
## FT004 105.1103 105.1100 105.1105  63.75047  63.35239  64.14855      2    2
## FT005 105.4734 105.4732 105.4736 201.57593 201.36133 201.79053      2    2
## FT006 105.7166 105.7160 105.7172 181.21578 181.08901 181.34256      2    2
##        peakidx ms_level
## FT001 112, 396        1
## FT002      111        1
## FT003  19, 317        1
## FT004  48, 348        1
## FT005 260, 580        1
## FT006 135, 444        1

Note that the EIC chr above would also contain the correspondence results for the selected m/z range which could also be accessed with featureDefinitions(chr).

Also, we can calculate simple per-feature summary statistic with the featureSummary function. This function reports for each feature the total number and the percentage of samples in which a peak was detected and the total numbers and percentage of these samples in which more than one peak was assigned to the feature.

#' Per-feature summary.
featureSummary(data) |> head() 
##       count perc multi_count multi_perc        rsd
## FT001     2  100           0          0 0.23642129
## FT002     1   50           0          0         NA
## FT003     2  100           0          0 0.24525296
## FT004     2  100           0          0 0.04687951
## FT005     2  100           0          0 0.22330558
## FT006     2  100           0          0 0.05904323

The final result from the LC-MS data preprocessing is a matrix with feature abundances, rows being features, columns samples. Such a matrix can be extracted with the featureValues function from the result object (see further below for an alternative, preferred way to extract preprocessing results with the quantify method). The function takes two additional parameters value and method: value defines the column in the chromPeaks table that should be reported in the matrix, and method the approach to handle cases in which more than one peak in a sample is assigned to the feature.

Below we set value = "into" (the default) to extract the total integrated peak area and method = "maxint" to report the peak area of the peak with the largest intensity for features with multiple peaks in a sample.

#' feature intensity matrix
fmat <- featureValues(data, value = "into", method = "maxint")
head(fmat)
##       20171016_POOL_POS_1_105-134.mzML 20171016_POOL_POS_3_105-134.mzML
## FT001                        3202.7445                        2285.2830
## FT002                        3605.3915                               NA
## FT003                         744.8752                        1057.4312
## FT004                       18126.4603                       19369.4039
## FT005                       23243.6129                       31960.3709
## FT006                         671.5842                         617.7545

While we do have abundances reported for most features, we might also have missing values for some, like for feature FT002 in the second sample above. Such NAs occur if no chromatographic peak was assigned to a feature, either because peak detection failed, or because the corresponding ion is absent in the respective sample. One possibility to deal with such missing values is data imputation. With the fillChromPeaks function, xcms provides however an alternative approach that integrates the signal measured at the m/z - retention time region of the feature in the original files of samples for which an NA was reported hence filling-in missing peak data. Different approaches for this gap-filling are available (see ?fillChromPeals), with the method selected and configured with ChromPeakAreaParam being the preferred one: this approach defines the region from which the signal should be integrated based on the m/z and rt range of the identified chromatographic peaks of the feature.

Below we perform the gap-filling with the ChromPeakAreaParam approach.

#' Number of missing values
sum(is.na(fmat))
## [1] 135
data <- fillChromPeaks(data, param = ChromPeakAreaParam())

#' How many missing values after
sum(is.na(featureValues(data)))
## [1] 28
fmat_fld <- featureValues(data, value = "into", method = "maxint")
head(fmat_fld)
##       20171016_POOL_POS_1_105-134.mzML 20171016_POOL_POS_3_105-134.mzML
## FT001                        3202.7445                        2285.2830
## FT002                        3605.3915                        3183.9546
## FT003                         744.8752                        1057.4312
## FT004                       18126.4603                       19369.4039
## FT005                       23243.6129                       31960.3709
## FT006                         671.5842                         617.7545

With fillChromPeaks we could thus rescue signal for all but 23 features with missing values. Note that filled-in peak information can also be removed any time with the dropFilledChromPeaks function. Also, setting filled = FALSE in the featureValues function would return only data from detected peaks.

Note also that, in addition to the featureValues method that simply extracts the feature matrix, the preprocessing results can also be converted to a SummarizedExperiment that, besides containing the feature quantification, stores also the phenotype data (sample descriptions) and feature definitions. Below we use the quantify method to extract the preprocessing results as a SummarizedExperiment. The function takes any parameter of the featureValues function as additional arguments, thus, with the call below we extract all values (from detected and gap-filled peaks) and sum up the signals of all chromatographic peaks assigned to a feature in a sample.

library(SummarizedExperiment)
res <- quantify(data, method = "sum", value = "into")

The SummarizedExperiment is the main data object in Bioconductor to store quantified omics data and hence an ideal container for the LC-MS data preprocessing results. The column annotations can be accessed with colData, the feature definitions (row annotations) with rowData:

colData(res)
## DataFrame with 2 rows and 5 columns
##                                           file injection_idx      sample
##                                    <character>     <numeric> <character>
## 20171016_POOL_POS_1_105-134.mzML 20171016_P...             1      POOL_1
## 20171016_POOL_POS_3_105-134.mzML 20171016_P...            19      POOL_2
##                                        group spectraOrigin
##                                  <character>   <character>
## 20171016_POOL_POS_1_105-134.mzML        POOL /home/jo/P...
## 20171016_POOL_POS_3_105-134.mzML        POOL /home/jo/P...
rowData(res)
## DataFrame with 361 rows and 9 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001   105.042   105.042   105.042   167.686   167.485   167.887         2
## FT002   105.042   105.042   105.042   157.719   157.719   157.719         1
## ...         ...       ...       ...       ...       ...       ...       ...
## FT360   133.973   133.973   133.973   206.871   206.390   207.352         2
## FT361   133.973   133.973   133.973   200.248   200.248   200.248         1
##            POOL  ms_level
##       <numeric> <integer>
## FT001         2         1
## FT002         1         1
## ...         ...       ...
## FT360         2         1
## FT361         1         1

The quantified feature abundances can be accessed with the assay method providing with the second argument the name of the assay matrix to extract:

head(assay(res, "raw"))
##       20171016_POOL_POS_1_105-134.mzML 20171016_POOL_POS_3_105-134.mzML
## FT001                        3202.7445                        2285.2830
## FT002                        3605.3915                        3183.9546
## FT003                         744.8752                        1057.4312
## FT004                       18126.4603                       19369.4039
## FT005                       23243.6129                       31960.3709
## FT006                         671.5842                         617.7545

The SummarizedExperiment supports several such assay matrices (as long as they have the same dimensions). We can thus add for example the feature quantification excluding filled-in signal as an additional assay matrix:

assays(res)$raw_detected <- featureValues(data, method = "sum",
                                          value = "into", filled = FALSE)

With that we have now two assay matrices in our result object:

assayNames(res)
## [1] "raw"          "raw_detected"

Analogously we could add normalized data to this object or any other processed values. The advantage of having all the data within one object is obvious: any subsetting of the data ensures that all assays and annotations are subsetted correctly.

One final thing worth mentioning is that xcms result objects keep, next to the preprocessing results, also a history of all processing steps and all parameter objects used during the analysis. The process history can be accessed with the processHistory function.

#' Overview of the performed processings
processHistory(data)
## [[1]]
## Object of class "XProcessHistory"
##  type: Peak detection 
##  date: Thu May 25 08:48:48 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: CentWaveParam 
##  MS level(s) 1 
## 
## [[2]]
## Object of class "XProcessHistory"
##  type: Peak refinement 
##  date: Thu May 25 08:48:49 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: MergeNeighboringPeaksParam 
##  MS level(s) 1 
## 
## [[3]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Thu May 25 08:48:51 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[4]]
## Object of class "XProcessHistory"
##  type: Retention time correction 
##  date: Thu May 25 08:48:51 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakGroupsParam 
##  MS level(s) 1 
## 
## [[5]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Thu May 25 08:48:54 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[6]]
## Object of class "XProcessHistory"
##  type: Missing peak filling 
##  date: Thu May 25 08:48:55 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: ChromPeakAreaParam 
##  MS level(s) 1

The actual parameter object that was used to configure one particular analysis step can be accessed with processParam:

#' Access the parameter class for a processing step
processParam(processHistory(data)[[1]])
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 30
##  - peakwidth: [1]  2 10
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 2
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 0
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE

All this information is also stored in the metadata slot of the SummarizedExperiment:

metadata(res)
## [[1]]
## Object of class "XProcessHistory"
##  type: Peak detection 
##  date: Thu May 25 08:48:48 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: CentWaveParam 
##  MS level(s) 1 
## 
## [[2]]
## Object of class "XProcessHistory"
##  type: Peak refinement 
##  date: Thu May 25 08:48:49 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: MergeNeighboringPeaksParam 
##  MS level(s) 1 
## 
## [[3]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Thu May 25 08:48:51 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[4]]
## Object of class "XProcessHistory"
##  type: Retention time correction 
##  date: Thu May 25 08:48:51 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakGroupsParam 
##  MS level(s) 1 
## 
## [[5]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Thu May 25 08:48:54 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[6]]
## Object of class "XProcessHistory"
##  type: Missing peak filling 
##  date: Thu May 25 08:48:55 2023 
##  info:  
##  fileIndex: 1,2 
##  Parameter class: ChromPeakAreaParam 
##  MS level(s) 1

Analysis could now proceed with by e.g. normalizing the data to remove any technical variances in the feature abundances, feature grouping (compounding) using the MsFeatures package, differential abundance analysis and ultimately also annotation and identification of the LC-MS features (e.g. with help of the MetaboAnnotation package (Rainer et al. 2022)).

4 Bonus material - peak detection fun

In this section we apply the lessons learned from previous sections, in particular how to adapt peak detection setting on a rather noisy chromatographic data. Below we load the example data from a text file.

data <- read.table("data/chromatogram.txt", sep = "\t", header = TRUE)
head(data)
##    rt intensity
## 1 100         0
## 2 110         0
## 3 120         1
## 4 130         2
## 5 140         4
## 6 150         6

Our data has two columns, one with retention times and one with intensities. We can now create a Chromatogram object from that and plot the data.

chr <- Chromatogram(rtime = data$rt, intensity = data$intensity)
par(mar = c(2, 2, 0, 0))
plot(chr)

There are two peaks present in the data, with the signal from the latter being particularly noisy. The goal is now to perform the peak detection and to identify the two peaks. A first try with the default settings for centWave clearly shows that we have to tune the parameters (note that the setting of sn = 0 is required for the present data set as there are not enough background data points for the algorithm to estimate the noise level properly).

Which parameter would you now adapt to the data? What would be your choices? Go ahead and try different settings or setting combination to see if you can succeed in detecting the two peaks. Eventually you might even try a different peak detection algorithm (e.g. MatchedFilterParam).

xchr <- findChromPeaks(chr, param = CentWaveParam(sn = 0))
par(mar = c(2, 2, 0, 0))
plot(xchr)

With the default parameters centWave clearly failed to identify the two large peaks, defining only smaller fragments of them as potential peaks. Especially the second peak with its peculiar tri-forked shape seems to cause troubles. This would be even for a hydrophilic liquid interaction chromatography (HILIC), known to potentially result in noisy odd-shaped peaks, a rather unusual peak shape. In fact, the signal we were analyzing here is not of chromatographic origin:

Our example data represents a panorama picture featuring mountains from the Dolomites, the Paternkofel (left peak, colored red) and the famous Drei Zinnen (right tri-forked peak colored green).

5 Session information

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3          png_0.1-8                  
##  [3] SummarizedExperiment_1.31.1 GenomicRanges_1.53.1       
##  [5] GenomeInfoDb_1.37.1         IRanges_2.35.1             
##  [7] MatrixGenerics_1.13.0       matrixStats_0.63.0         
##  [9] MetaboCoreUtils_1.9.0       Spectra_1.11.2             
## [11] MsExperiment_1.3.0          xcms_3.99.3                
## [13] MSnbase_2.26.0              ProtGenerics_1.33.0        
## [15] S4Vectors_0.39.1            mzR_2.35.1                 
## [17] Rcpp_1.0.10                 Biobase_2.61.0             
## [19] BiocGenerics_0.47.0         BiocParallel_1.35.2        
## [21] knitr_1.42                  BiocStyle_2.29.0           
## [23] rmarkdown_2.21             
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                rlang_1.1.1                
##  [3] magrittr_2.0.3              clue_0.3-64                
##  [5] MassSpecWavelet_1.67.0      compiler_4.3.0             
##  [7] vctrs_0.6.2                 pkgconfig_2.0.3            
##  [9] crayon_1.5.2                fastmap_1.1.1              
## [11] magick_2.7.4                XVector_0.41.1             
## [13] utf8_1.2.3                  preprocessCore_1.63.1      
## [15] MultiAssayExperiment_1.27.0 xfun_0.39                  
## [17] zlibbioc_1.47.0             cachem_1.0.8               
## [19] jsonlite_1.8.4              progress_1.2.2             
## [21] highr_0.10                  DelayedArray_0.27.4        
## [23] prettyunits_1.1.1           parallel_4.3.0             
## [25] cluster_2.1.4               R6_2.5.1                   
## [27] bslib_0.4.2                 limma_3.57.1               
## [29] jquerylib_0.1.4             bookdown_0.34              
## [31] iterators_1.0.14            igraph_1.4.3               
## [33] Matrix_1.5-4.1              splines_4.3.0              
## [35] tidyselect_1.2.0            yaml_2.3.7                 
## [37] doParallel_1.0.17           codetools_0.2-19           
## [39] affy_1.78.0                 lattice_0.21-8             
## [41] tibble_3.2.1                plyr_1.8.8                 
## [43] evaluate_0.21               survival_3.5-5             
## [45] pillar_1.9.0                affyio_1.71.0              
## [47] BiocManager_1.30.20         foreach_1.5.2              
## [49] MALDIquant_1.22.1           ncdf4_1.21                 
## [51] generics_0.1.3              RCurl_1.98-1.12            
## [53] hms_1.1.3                   ggplot2_3.4.2              
## [55] munsell_0.5.0               scales_1.2.1               
## [57] glue_1.6.2                  lazyeval_0.2.2             
## [59] MsFeatures_1.9.0            tools_4.3.0                
## [61] mzID_1.39.0                 robustbase_0.95-1          
## [63] QFeatures_1.11.0            vsn_3.68.0                 
## [65] RANN_2.6.1                  fs_1.6.2                   
## [67] XML_3.99-0.14               grid_4.3.0                 
## [69] impute_1.75.1               MsCoreUtils_1.13.0         
## [71] colorspace_2.1-0            GenomeInfoDbData_1.2.10    
## [73] cli_3.6.1                   fansi_1.0.4                
## [75] S4Arrays_1.1.4              dplyr_1.1.2                
## [77] AnnotationFilter_1.25.0     pcaMethods_1.93.0          
## [79] gtable_0.3.3                DEoptimR_1.0-13            
## [81] sass_0.4.6                  digest_0.6.31              
## [83] SparseArray_1.1.6           htmltools_0.5.5            
## [85] multtest_2.57.0             lifecycle_1.0.3            
## [87] MASS_7.3-60

References

Conley, Christopher J, Rob Smith, Ralf J O Torgrip, Ryan M Taylor, Ralf Tautenhahn, and John T Prince. 2014. Massifquant: open-source Kalman filter-based XC-MS isotope trace feature detection. Bioinformatics 30 (18): 2636–43.
Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data.” Journal of Proteome Research, September. https://doi.org/10.1021/acs.jproteome.0c00313.
Katajamaa, Mikko, Jarkko Miettinen, and Matej Oresic. 2006. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 22 (5): 634–36.
Prince, John T, and Edward M Marcotte. 2006. Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical Chemistry 78 (17): 6140–52.
Rainer, Johannes, Andrea Vicini, Liesa Salzer, Jan Stanstrup, Josep M. Badia, Steffen Neumann, Michael A. Stravs, et al. 2022. “A Modular and Expandable Ecosystem for Metabolomics Data Annotation in R.” Metabolites 12 (2): 173. https://doi.org/10.3390/metabo12020173.
Smith, Colin A, Elizabeth J Want, Grace O’Maille, Ruben Abagyan, and Gary Siuzdak. 2006. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry 78 (3): 779–87.
Smith, Rob, Andrew D Mathis, Dan Ventura, and John T Prince. 2014. Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view. BMC Bioinformatics 15 Suppl 7 (Suppl 7): S9.
Tautenhahn, Ralf, Christoph Böttcher, and Steffen Neumann. 2008. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 9 (1): 504.